© Krzysztof Najman, Kamila Migdał-Najman, Katarzyna Raca, Agata Majkowska Artykuł udostępniony na licencji CC BY-SA 4.0
Textual data (textposts) account for a significant portion of all data posted on the Internet. One piece of information that researchers are seeking to obtain about the authors of textposts is their age, which is not always made public, yet important from the point of view of marketing, social and economic research. Language research shows that representatives of different age groups tend to use a distinct set of vocabulary and grammatical forms. Presumably, textpost formatting as well as the level of the correctness of the text itself may also differentiate user age groups. The aim of the research presented in this article is to use the elements typically eliminated from texts during text mining processes, such as emoticons, punctuation marks and words that are not content carriers (stopwords) to distinguish the age groups of the authors of Twitter (currently X) posts. The study analysed nearly 3 million tweets in English posted before July 2020. The research shows that distinguished textpost elements differentiate the age groups only to a small extent. The youngest users stood out the most due to their specific language characteristics in textposts.
Twitter, text mining, user age
C38, C88, M30
Arabie, P., Hubert, L. J., & De Soete, G. (Eds). (1996). Clustering and Classification. World Scientific. https://doi.org/10.1142/1930.
Baker, F. B., & Hubert, L. J. (1975). Measuring the Power of Hierarchical Cluster Analysis. Journal of the American Statistical Association, 70(349), 31–38. https://doi.org/10.1080/01621459.1975.10480256.
Balicki, A. (2009). Statystyczna analiza wielowymiarowa i jej zastosowania społeczno-ekonomiczne. Wydawnictwo Uniwerytetu Gdańskiego.
Dunn, J. C. (1974). Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics, 4(1), 95–104. https://doi.org/10.1080/01969727408546059.
Florek, K., Łukaszewicz, J., Perkal, J., Steinhaus, H., & Zubrzycki, S. (1951). Taksonomia wrocławska. Przegląd Antropologiczny, 17, 193–211.
Goban-Klas, T. (2005). Media i komunikowanie masowe. Teorie i analizy prasy, radia, telewizji i Internetu. Wydawnictwo Naukowe PWN.
Goodman, L. A., & Kruskal, W. H. (1954). Measures of Association for Cross Classifications. Journal of the American Statistical Association, 49(268), 732–764. https://doi.org/10.1080/01621459.1954.10501231 .
Gower, J. C. (1967). A Comparison of Some Methods of Cluster Analysis. Biometrics, 23(4), 623–637. https://doi.org/10.2307/2528417.
Hubert, L. (1974). Approximate evaluation techniques for the single-link and complete-link hierarchical clustering procedures. Journal of the American Statistical Association, 69(347), 698–704. https://doi.org/10.1080/01621459.1974.10480191.
Hull, D. L. (1970). Contemporary Systematic Philosophies. Annual Review of Ecology, Evolution, and Systematics, 1(1), 19–54. https://doi.org/10.1146/ANNUREV.ES.01.110170.000315.
Jakobson, R. (1960). Poetyka w świetle językoznawstwa. Pamiętnik Literacki, 51(2), 431–473.
Jambu, M., & Lebeaux, M. O. (1978). Classification automatiqe pour l’analyse des donnees: vol. 1. Méthodes et algorithms. Paris Dunod.
Kruskal, J. B. (1964). Nonmetric multidimensional scaling: A numerical method. Psychometrika, 29(2), 115–129. https://doi.org/10.1007/BF02289694.
Lance, G. N., & Williams, W. T. (1966). A Generalized Sorting Strategy for Computer Classifications. Nature, 212, 218. https://doi.org/10.1038/212218a0.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. E. Le Cam & J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability: vol. 1. Statistics (pp. 281–298). University of California Press. https://projecteuclid.org/proceedings/berkeley-symposium-on-mathematical-statistics -and-probability/Proceedings-of-the-Fifth-Berkeley-Symposium-on-Mathematical-Statistics-and /Chapter/Some-methods-for-classification-and-analysis-of-multivariate-observations/bsmsp /1200512992 .
Majkowska, A., Migdał-Najman, K., Najman, K., & Raca, K. (2021). Identification of the Words Most Frequently Used by Different Generations of Twitter Users. In K. Jajuga, K. Najman & M. Walesiak (Eds.), Data Analysis and Classification. Methods and Applications (pp. 27–47). Springer. https://doi.org/10.1007/978-3-030-75190-6_3.
Majkowska, A., Migdał-Najman, K., Najman, K., & Raca, K. (2022). Graphic Characters as Twitter Age Group Identifiers. In K. Jajuga, G. Dehnel & M. Walesiak (Eds.), Modern Classification and Data Analysis. Methodology and Applications to Micro- and Macroeconomic Problems (pp. 275–288). Springer, Cham. https://doi.org/10.1007/978-3-031-10190-8_19.
Mcquitty, L. L. (1960). Hierarchical Linkage Analysis for the Isolation of Types. Educational and Psychological Measurement, 20(1), 55–67. https://doi.org/10.1177/001316446002000106.
Mcquitty, L. L. (1966). Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data. Educational and Psychological Measurement, 26(4), 825–831. https://doi.org/10.1177 /001316446602600402 .
Mcquitty, L. L. (1967). Expansion of Similarity Analysis By Reciprocal Pairs for Discrete and Conti- nuous Data. Educational and Psychological Measurement, 27(2), 253–255. https://doi.org/10.1177 /001316446702700202 .
Migdał-Najman, K., & Najman, K. (2013). Samouczące się sztuczne sieci neuronowe w grupowaniu i klasyfikacji danych. Teoria i zastosowania w ekonomii. Wydawnictwo Uniwersytetu Gdańskiego.
Mojena, R. (1977). Hierarchical grouping methods and stopping rules: An evaluation. The Computer Journal, 20(4), 359–363. https://doi.org/10.1093/COMJNL/20.4.359.
Pociecha, J., Podolec, B., Sokołowski, A., & Zając, K. (1988). Metody taksonomiczne w badaniach społeczno-ekonomicznych. Państwowe Wydawnictwo Naukowe.
Pratama, B. Y., & Sarno, R. (2016). Personality classification based on Twitter text using Naive Bayes, KNN and SVM. In Proceedings of 2015 International Conference on Data and Software Engineering (pp. 170–174). Universitas Gadjah Mada. https://doi.org/10.1109/ICODSE.2015.7436992.
Rodrigues, D., Prada, M., Gaspar, R., Garrido, M. V., & Lopes, D. (2018). Lisbon Emoji and Emoticon Database (LEED): Norms for emoji and emoticons in seven evaluative dimensions. Behavior Research Methods, 50(1), 392–405. https://doi.org/10.3758/S13428-017-0878-6.
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7.
Sneath, P. H. A. (1957). The Application of Computers to Taxonomy. Journal of General Microbiology, 17(1), 201–226. https://doi.org/10.1099/00221287-17-1-201.
Sokal, R. R., & Michener, C. D. (1958). A Statistical Method for Evaluating Systematic Relationships. The University of Kansas Science Bulletin, 38(22), 1409–1438.
Sokal, R. R., & Rohlf, F. J. (1962). The comparison of dendrograms by objective methods. Taxon, 11(2), 33–40. https://doi.org/10.2307/1217208.
Sokal, R. R., & Sneath, P. H. A. (1963). Principles of Numerical Taxonomy. W. H. Freeman & Company.
Sztompka, P. (2002). Socjologia. Analiza społeczeństwa. Znak.
Tuteja, S. K., & Bogiri, N. (2016). Email Spam filtering using BPNN classification algorithm. In Pro- ceedings of 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (pp. 915–919). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICACDOT.2016.7877720.
Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association, 58(301), 236–244. https://doi.org/10.1080/01621459.1963.10500845.
Wyka, K. (1939). Rozwój problemu pokolenia. Przegląd Socjologiczny, 7(1–2), 159–192.
Wyka, K. (1977). Pokolenia literackie. Wydawnictwo Literackie.